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In partially linear single-index models, we obtain the semipara- 
metrically efficient profile least-squares estimators of regression co- 
efficients. We also employ the smoothly clipped absolute deviation 
penalty (SCAD) approach to simultaneously select variables and es- 
timate regression coefficients. We show that the resulting SCAD esti- 
mators are consistent and possess the oracle property. Subsequently, 
we demonstrate that a proposed tuning parameter selector, BIC, 
identifies the true model consistently. Finally, we develop a linear hy- 
pothesis test for the parametric coefficients and a goodness-of-fit test 
for the nonparametric component, respectively. Monte Carlo studies 
are also presented. 



1. Introduction. Regression analysis is commonly used to explore the 
relationship between a response variable Y and its covariates Z. For the 
sake of convenience, one often employs a linear regression model E(Y\Z) = 
Z T ot to assess the impact of the covariates on the response, where a is 
an unknown vector and E stands for "expectation." In practice, however, 
the linear assumption may not be valid. Hence, it is natural to consider 
a single-index model E(Y\Z) = r}{Z at), in which the link function r\ is 
unknown. Accordingly, various parameter estimators of single-index models 
have been proposed [e.g., Powell, Stock and Stocker (1989); Duan and Li 
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(1991); Hardle, Hall and Ichimura (1993); Ichimura (1993); Horowitz and 
Hardle (1996); Liang and Wang (2005)]. Detailed discussion and illustration 
of the usefulness of this model can be found in Horowitz (1998). Although 
the single-index model plays an important role in data analysis, it may not 
be sufficient to explain the variation of responses via covariates Z . Therefore, 
Carroll et al. (1997) augmented the single- index model in a linear form with 
additional covariates X, which yields a partially linear single-index model 
(PLSIM), E(Y\Z,X) = r](Z T a) + X T /3. When Z is scalar and a = 1, the 
PLSIM reduces to the partially linear model, E(Y\Z, X) = r](Z) + X T /3 [see 
Speckman (1988)]. A comprehensive review of partially linear models can 
be found in Hardle, Liang and Gao (2000). 

To estimate parameters in partially linear single-index models, Carroll et 
al. (1997) proposed the backfitting algorithm. However, the resulting esti- 
mators may be unstable [see Yu and Ruppert (2002)] and undersmoothing 
the nonparametric function is necessary to reduce the bias of the paramet- 
ric estimators. Accordingly, Yu and Ruppert (2002) proposed the penalized 
spline estimation procedure, while Xia and Hardle (2006) applied the min- 
imum average variance estimation (MAVE) method, which was originally 
introduced by Xia et al. (2002) for dimension reduction. Although Yu and 
Ruppert's procedure is useful, it may not yield efficient estimators; that is, 
the asymptotic covariance of their estimators does not reach the semipara- 
metric efficiency bound [Carroll et al. (1997)]. In addition, these estimators 
need to be solved via an iterative procedure; that is, iteratively estimating 
the nonparametric component and the parametric component. We therefore 
propose the profile least-squares approach, which obtains efficient estimators 
and provides the efficient bound. Moreover, the resulting estimators can be 
found without using the iterative procedure mentioned above and hence may 
reduce the computational burden. 

In data analysis, the true model is often unknown; this allows the pos- 
sibility of selecting an underfitted (or overfitted) model, leading to biased 
(or inefficient) estimators and predictions. To address this problem, Tibshi- 
rani (1996) introduced the least absolute shrinkage and selection operator 
(LASSO) to shrink the estimated coefficients of superfluous variables to zero 
in linear regression models. Subsequently, Fan and Li (2001) proposed the 
smoothly clipped absolute deviation (SCAD) approach that not only selects 
important variables consistently, but also produces parameter estimators as 
efficiently as if the true model were known, a property not possessed by 
the LASSO. Because it is not a simple matter to formulate the penalized 
function via the MAVE's procedure, we employ the profile least-squares ap- 
proach to obtain the SCAD estimators for both parameter vectors a and /3. 
Furthermore, we establish the asymptotic results of the SCAD estimators, 
which include the consistency and oracle properties [see Fan and Li (2001)]. 
Simulation results are consistent with theoretical findings. 
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After estimating unknown parameters, it is natural to construct hypoth- 
esis tests to assess the appropriateness of the linear constraint hypothesis as 
well as the linearity of the nonparametric function. We demonstrate that the 
resulting test statistics are asymptotically chi-square distributed under the 
null hypothesis. In addition, simulation studies indicate that the test statis- 
tics perform well. The rest of this paper is organized as follows. Section 2 
introduces the profile least-squares estimators and the penalized SCAD es- 
timators. The asymptotic properties of these estimators are obtained. Sec- 
tion 3 presents hypothesis tests and their large-sample properties. Monte 
Carlo studies are presented in Section 4. Section 5 concludes the article 
with a brief discussion. All detailed proofs are deferred to the Appendix. 

2. Profile least-squares procedure. Suppose that {(Yi, Zi,Xi), i = 1, . . . ,n} 
is a random sample generated from the PLSIM 



where Z and X are p-dimensional and (/-dimensional covariate vectors, re- 
spectively, ol = . . . ,a p ) T , (3 = (/3i, . . . ,/3 g ) T , ?/(•) is an unknown differ- 
entiable function, e is the random error with mean zero and finite variance 
a 2 , and (Z T ,X T ) T and e are independent. Furthermore, we assume that 
|| a || = 1 and that the first element of a is positive, to ensure identifiability. 
We then employ the profile least-squares procedure to obtain efficient esti- 
mators and SCAD estimators in the following two subsections, respectively. 

2.1. Profile least-squares estimator. In semiparametric models, Severini 
and Wong (1992) applied the profile likelihood approach to estimate the 
parametric component. We adapt this approach to estimate unknown pa- 
rameters of partially linear single-index models. To begin, we re-express 
model (2.1) as 



where Y* = Y,— Xj (3 and A, = Zja. Then, for a given £ = (q! T ,/3 t ) t , 
we employ the local linear regression technique [Fan and Gijbels (1996)] to 
estimate 77, that is, to minimize 



with respect to a, b, where Kh(-) = l/hK(-/h), K(-) is a kernel function and 
h is a bandwidth. 



(2.1) 



Y = r]{Z T a) + X T (3 + e, 



Y* = 7/(Ai) + £i 



n 



(2.2) 



Y,{a + b(A t -u) + X?(3 - Yi} 2 K h {ki - u) 



(2.3) 
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where K jt (u, C) = £™=i K h {Zja - u){Zja - u)i{Xj(3 - Yi) 1 for j = 0, 1, 2 
and I = 0,1. Subsequently, following Jennrich's (1969) assumption (a), there 
exists a profile least-squares estimator £ = (a T ,/3 T ) T obtained by minimiz- 
ing the following profile least-squares function: 

n 

(2.4) Q(C) = ^{Y 4 - f)(Z?a, Q ~ X?f3} 2 . 

i=i 

It is noteworthy that we apply the Newton-Raphson iterative method to 
find the estimator this technique is distinct from the more commonly 
used iterative procedure in partially single-index models which iteratively 
updates estimates of the nonparametric component and the parametric com- 
ponent obtained from their corresponding objective functions. In addition, 
our proposed profile least-squares approach allows us to directly introduce 
the penalized function, as given in the next subsection. 

To study the large-sample properties of parameter estimators, we con- 
sider the true model with an unknown parameter vector £ = (ckq ) T - 
In addition, we assume that and 0q have the same dimensions as their 
corresponding parameter vectors q t and /3 T in the candidate model of this 
subsection. Moreover, we introduce the following notation: A® 2 = AA T for a 
matrix A, A = Z T a, A = Z T a, A = Z T cx , f = £ - E(£\A), £ = £-£(£|A ) 
and £ = £ — E(£\A) for any random variable (or vector) £, where E(£\A) is 
the local linear estimator of E(^\A). For example, Z = Z — E(Z\A) and 

X = X — E(X\A). We next present six conditions and then obtain the weak 
consistency and asymptotic normality of the profile least-squares estimators. 

(i) The function r/(-) is differentiable and not constant on the support 
U of Z T a. 

(ii) The function 7](z T a) and the density function of Z T a, f a (z), are 
both three times continuously differentiable with respect to z. The third 
derivatives are uniformly Lipschitz continuous over A C 7Z P for all u G {u = 
z T a:a£ A,z G Z C W}. 

(hi) i?(|y| mi ) < oo for some m\ > 3. The conditional variance of Y given 
(X, Z) is bounded and bounded away from 0. 

(iv) The kernel function K(-) is twice continuously differentiable with the 
support (—1,1). In addition, its second derivative is Lipschitz continuous. 
Moreover, J v? K(u) du = l if j = and = if j = 1. 

(v) The bandwidth h satisfies {nh^/imi-i)} log -i 

n — > oo and nh 8 — > 

as n — > oo . _ 

(vi) E{X m ) and E{Zr]' (A)}® 2 are positive-definite, where rj' is the first 
derivative of r/. 

Theorem 1. Under the regularity conditions (i)-(vi), with probability 
tending to one, £ is a consistent estimator of Q. Furthermore, \A^(C — Co) ~~ ^ 
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iV(0,cr 2 D~ 1 ) in distribution, where D = E[{n' (A )Zj , Xj} T ]® 2 . Moreover, 
C, is a semiparametrically efficient estimator. 

Because (Z T , A T ) T is independent of e, we are able to demonstrate that 
the asymptotic variance of Xia and Hardle's (2006) minimum average vari- 
ance estimator is cr 2 D~ 1 , which indicates that MAVE is also an efficient 
estimator. 

After having estimated a and (3, we obtain the following estimator of rj{u): 

K 00 (u, C)K 20 {u, C) - Kf (u, C) 

If the density function /a of Ao is positive and the derivative of E(s 2 \Ao = 
u) exists, then we can further demonstrate that (nK) l l 2 {fj{u) — r/(u) — 1/ 
2k27]"(u)h 2 } converges to a normal distribution N{Q,o~ 2 u ), where ot u = 

flliu) J K 2 {t) dt x E{e 2 \ A =u),k 2 = fK(t)t 2 dt and 7]" is the second deriva- 
tive of n. It is also noteworthy that one usually either introduces a trimming 
function or adds a ridge parameter [see Seifert and Gassert (1996)] when the 
denominator of f) closes to zero. 

Remark 1. Condition (v) indicates that Theorem 1 is applicable for 
a reasonable range of bandwidths. Numerical studies confirm it; our re- 
sults remain stable by employing various bandwidths around the optimal 
bandwidth selected by cross-validation, in particular when the sample size 
becomes large. 

2.2. Penalized profile least-squares estimator. In practice, the true model 
is often unknown a priori. An underfitted model can yield biased estimates 
and predicted values, while an overfitted model can degrade the efficiency 
of the parameter estimates and predictions. This motivates us to apply the 
penalized least-squares approach to simultaneously estimate parameters and 
select important variables. To this end, we consider a penalized profile least- 
squares function 

(2.5) Cp(Q = l -Q(Q + n f> Al .(| aj |) + nj2Px 2k (Wk\), 

j=l k=l 

where p\(-) is a penalty function with a regularization parameter A. Through- 
out this paper, we allow different elements of a and f3 to have different 
penalty functions with different regularization parameters. For the purpose 
of selecting X-variables only, we simply set p\ 1:j (-) =0 and the resulting 
penalized profile least-squares function becomes 

(2.6) c P (C)= l -Q(Q + nj2vx 2k m)- 

k=l 
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Similarly, if we are only interested in selecting Z- variables, then we set 
P\ 2k {-) = so that 

(2.7) £p(C) = iQ(C)+n^> Al .(K|). 

i=i 

There are various penalty functions available in the literature. To obtain 
the oracle property of Fan and Li (2001), we adopt their SCAD penalty, 
whose first derivative is 

P ' x (e) = \{i(e<\)+ i j^^i(9>\)}, 

and where p\(0) = 0, a = 3.7 and (t) + = tl{t > 0} is the hinge loss function. 
For the given tuning parameters, we obtain the penalized estimators by 
minimizing Cp{a,(3) with respect to a and (3. For the sake of simplicity, 
we denote the resulting estimators by 6l\ 1 and /3 X , 2 ■ 

In what follows, we study the theoretical properties of the penalized 
profile least-squares estimators with the SCAD penalty. Without loss of 
generality, it is assumed that the correct model has regression coefficients 
e*o = ( Q iO' Q: 2o) T an ^ Pq = Oio,/92o) T ) where cx±q and /3 10 are po x 1 and 
qo x 1 nonzero components of ctQ and /3 , respectively, and Q20 and f3 20 are 
{p ~ Po) x 1 an d (q — qo) x 1 vectors with zeros. In addition, we define Z\ 
and X\ in such a way that they consist of the first po and qo elements of Z 
and X, respectively. We define Z\ and X\ analogously. Finally^we use the 
following notation for simplicity: x = E{Xf 2 ), = E{ZiX^r]'(A)} 

andr^^^A)}® 2 . 

Theorem 2. Under the regularity conditions (i)-(vi) ; if, for all k and j, 
Ay — > 0, yjn\ij — > 00, X 2 k — > and \Jn\2k — > 00 as n — »■ 00, with probability 
tending to one, then the penalized estimators 6t\ l = (^ia 1 )Q ; 2Ai) T an ^ P\ 2 = 
(PiX 2 ^2\ 2 ) T sati sfy: 

(a) a 2Xl = and /3 2 a 2 = 0; 

(b) v^(a 1Al - aio) -> N{0,a\T^ - T^T^ ~ J" 1 } and 

Analogous results can be established for those parameter estimators ob- 
tained via the penalized functions (2.6) or (2.7). 

Kong and Xia (2007) developed a cross-validation-based variable selection 
procedure in single-index models and observed the difference between the 
popular leave-m-out cross-validation method in the variable selection of the 
linear model and the single-index model. It is noteworthy that we need two 
tuning parameters in (2.5), Ai and A2, imposed on the linear part and the 
single-index part, respectively. They can be in the same order in the large- 
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sample sense, on the basis of the assumptions in Theorem 2, although the 
distinction between these two tuning parameters in practice is anticipated, 
as will be discussed below. Theorem 2 indicates that the proposed variable 
selection procedure possesses the oracle property. However, this attractive 
feature relies on the tuning parameters. To this end, we adopt Wang, Li and 
Tsai's (2007) BIC selector to choose the regularization parameters Ay and 
A2fc. Because it is computationally expensive to minimize BIC, defined below, 
with respect to the (p + q)-dimensional regularization parameters, we follow 
the approach of Fan and Li (2004) to set Ay = A SE(d^) and X 2k = A SE(0£), 

where A is the tuning parameter, and SE(a") and SE(/3£) are the standard 
errors of the unpenalized profile least-squares estimators of otj and j3k, re- 
spectively, for j = 1, . . . ,p and k = 1, . . . , q. Let the resulting SCAD estima- 
tors be aj and /3 A . We then select A by minimizing the objective function 

(2.8) BIC(A) = log{(MSE(A)} + (log(n)/n) DF A , 

where MSE(A) = n~ l £™ =1 {^ - rj(Zja x ) -Xj/3 X } 2 and DF A is the number 
of nonzero coefficients of both a x and f3 x . More specifically, we choose A to 
be the minimizer among a set of grid points over bounded interval [0, A max ], 
where A max / y/n — > as n — > oo. The resulting optimal tuning parameter 
is denoted by A. In practice, a plot of the BIC(A) against A can be used 
to determine an appropriate A max to ensure that the BIC (A) reaches its 
minimum around the middle of the range of A. The grid points for A are 
then taken to be evenly distributed over [0, A max ] so that they are chosen 
to be fine enough to avoid multiple minimizers of BIC(A). In our numerical 
studies, the range for A and the number of grid points are set in the same 
manner as those in Wang, Li and Tsai (2007) and Zhang, Li and Tsai (2010). 
Based on our limited experience, the resulting estimate of A is quite stable 
with respect to the number of grid points when they are sufficiently fine. 

To investigate the theoretical properties of the BIC selector, we denote by 
S = {ji, . . . ,jd} the set of the indices of the covariates in the given candidate 
model, which contains indices of both X and Z. In addition, let St be 
the true model, Sp be the full model and S x be the set of the indices of 
the covariates selected by the SCAD procedure with tuning parameter A. 
For a given candidate model S with parameter vectors a s and (3 S , let a s 
and j3 s be the corresponding profile least-squares estimators. Then, define 
cr 2 (S) = n" 1 X^ILiOi — Vi^i^-s) — Xj (3 S } 2 and further assume that: 

(A) for any S C Sf, ff^(S) — > cr 2 (S) in probability for some cr 2 (S) > 0; 

(B) for any 5 ?5 S T , we have a 2 (S) > a 2 (S T ). 

It is noteworthy that (A) and (B) are the standard conditions for investi- 
gating parameter estimation under model misspecification [e.g., see Wang, 
Li and Tsai (2007)]. 
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We next present the asymptotic property of the BIC-type tuning param- 
eter selector. 

Theorem 3. Under conditions (A), (B) and the regularity conditions 
(i)-(vi), we have 

(2.9) P(S~ X = S T )^1. 

Theorem 3 demonstrates that the BIC tuning parameter selector enables 
us to select the true model consistently. 

3. Hypothesis tests. Applying the estimation method described in the 
previous section, we propose two hypothesis tests. The first is for general 
hypothesis testing of regression parameters and the second is for testing the 
nonparametric function. 

3.1. Testing parametric components. Consider the general linear hypoth- 
esis 

(3.1) H :AC = S versus H X :AC^S, 

where A is a known m x (p + q) full-rank matrix and S is an m x 1 vector. 
A simple example of (3.1) is to test whether some elements of a and f3 are 
zero; that is, 

H Q : a h = ■ ■ ■ = ati k = and /3j ± = ■ ■ ■ = fy, = 

versus 

Hi : not all , . . . , ai k and f3j 1 ,. . . , (3j l are equal to 0. 

Under Ho and H\, let £ = (aj ,/3q) T and £i = (aJ,f3j) T be the cor- 
responding parameter vectors, and let Qq and fij be the parameter spaces 
of Co an d Ci) respectively. It is noteworthy that this is a slight abuse of 
notation because Co was previously used to denote the true value of C in 
Section 2.1. Furthermore, define 

n 

Q(H ) = inf Q(Cq) = - X^o - f)(ti?a , C )} 2 

and 

n 

Q(Hi) = inf Q(d) = 52{Yi - Xjfr - ftiZjauCi)} 2 , 

where {/3 , Col an d {fii, Ci} are the profile least-squares estimators of {/3 , Col 
and {/3i,Ci}> respectively, and fj is the nonparametric estimator of n ob- 
tained via (2.3). Subsequently, we propose a test statistic, 

= n{Q(H )-Q(H 1 )} 
1 Q(Hi) 
and give its theoretical property below. 



PARTIALLY LINEAR SINGLE-INDEX MODELS 



9 



Theorem 4. Assume that the regularity conditions (i)-(vi) hold. Then: 

(a) under Hq in (3.1), Ti—^Xmi 

(b) under Hi in (3.1), T\ converges to a noncentral chi-squared distribu- 
tion with m degrees of freedom and noncentrality parameter <j) = 
lim n _ > . 00 nc _2 (A£ — <J) T (AD _1 A T ) _1 (A£ — 8), where D is defined as in 
Theorem 1. 

Analogously, we are able to construct the Wald test, W n = (A£ — 

d) T (AD~ 1 x A T )~ 1 (AC - 5), and demonstrate that W n and T\ have the 
same asymptotic distribution. 

3.2. Testing the nonparametric component. The nonparametric estimate 
of rj(-) provides us with descriptive and graphical information for exploratory 
data analysis. Using this information, it is possible to formulate a parametric 
model that takes into account the features that emerged from the prelimi- 
nary analysis. To this end, we introduce a goodness-of-fit test to assess the 
appropriateness of a proposed parametric model. Without loss of generality, 
we consider a simple linear model under the null hypothesis. Accordingly, 
the null and alternative hypotheses are given as follows: 

(3.2) Hq : ij(u) = Oq + 9\u versus H\:r)(u) ^0q + 0\u for some u, 

where 9q and 9\ are unknown constant parameters. 

Under H\, let a, (3 and r) be the corresponding profile least-squares and 
nonparametric estimators of a, (3 and rj, respectively. Under Hq, we use the 
same parametric estimators a and (3 as those obtained under H\, while the 
estimator of r] is fj(u) = 9q + 9iu, where #o an d 0\ are the ordinary least- 
squares estimators of Oq and 9\, respectively, by fitting — Xjfi versus 
Zf a. The resulting residual sums of squares under the null and alternative 
hypotheses are then 

n 

RSS(tfo) = - v(Z?a) - X?P} 2 

i=i 

and 

n 

RSS(Ui) = Y,{ Y i ~ v(Z?cx) - Xjf3} 2 . 
i=i 

To test the null hypothesis, we consider the following generalized F-test: 
T _r A - n{RSS(ffo)-RSS(i/ 1 )} 
2 2 RSS(^i) 

where r K = {K(0) - 0.5 / K 2 (u) du}{J{K(u) - 0.5K * K(u)} du}" 1 and K * 
K denotes the convolution of K. The theoretical property of T2 is given 
below. 
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Theorem 5. Assume that the regularity conditions (i)-(vi) hold. Under 
Hq in (3.2), T2 has an asymptotic \ 2 distribution with df n degrees of free- 
dom, where df n = r^|W|{-K"(0) — 0.5 J K 2 (u) du}/h, \U\ stands for the length 
of U and U is defined in regularity condition (i) . 

The above theorem unveils the Wilks phenomenon for the partially linear 
single-index model. Furthermore, we can obtain an analogous result when 
the simple linear model under Hq is replaced by a multiple regression model. 

4. Simulation studies. In this section, we present four Monte Carlo stud- 
ies which evaluate the finite-sample performance of the proposed estimation 
and testing methods. The first two examples illustrate the performance of 
the profile least-squares estimator and the SCAD-based variable selection 
procedure proposed in Sections 2.1 and 2.2, respectively. The next two exam- 
ples explore the performance of the test statistics developed in Sections 3.1 
and 3.2. 

Example 4.1. We generated 500 realizations, each consisting of n = 
50, 100 and 200 observations, from each of the following models: 

(4.1) y = 4{(zi + z 2 -l)/v / 2} 2 + 4 + 0.2e; 

(4.2) y = sin[{(z! + z 2 + z 3 )/V3 - a}vr/(6 - a)} + (3X + O.le, 

where z\, Z2,z 3 are independent and uniformly distributed on [0,1], X = 
for the odd numbered observations and X = 1 for the even numbered obser- 
vations, e has the standard normal distribution, a = 0.3912 and b = 1.3409. 
The resulting parameters of models (4.1) and (4.2) are a. = (ai,a2) T = 
(0.7071, 0.7071) T and (a, (3) = (a 1} a 2 , a 3 , /3) T = (0.5774, 0.5774, 0.5774, 0.3) T 
with (f>Q = arccos(ai) = 0.7854 as in Xia and Hardle (2006) [or 7r/4 in Hardle, 
Hall and Ichimura (1993), page 165]. 

Model (4.1) was analyzed by Hardle, Hall and Ichimura (1993) and Xia 
and Hardle (2006), while model (4.2) was investigated by Carroll et al. (1997) 
and Xia and Hardle (2006). For both models, Xia and Hardle claimed that 
their MAVE approach outperforms those of Hardle, Hall and Ichimura (1993) 
and Carroll et al. (1997), respectively. It is therefore of interest to compare 
the profile least-squares (abbreviated to PrLS) method with MAVE. Tables 1 
and 2 present the results for models (4.1) and (4.2), respectively. Both tables 
indicate that the profile least-squares method yields accurate estimates and 
that the mean squared error becomes smaller as the sample size gets larger, 
which is consistent with the theoretical finding. Furthermore, Table 1 shows 
that the mean squared errors of a and (3 are smaller than those computed via 
the MAVE method [see Table 1 of Xia and Hardle (2006)]. Table 2 suggests 
that the biases and their associated mean squared errors of a and (3 are 
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Table 1 

Simulation results for Example 4-1: the profile least-squares estimates (PrLS) 
and their corresponding mean squared errors (xW~ 4 ) for model (4-1) 



n 


ai (= 0.7071) 


a 2 (= 0.7071) 


<po (= 0.7854) 


50 


0.7053 (21.5274) 


0.7059 (21.3398) 


0.7859 (42.9158) 


100 


0.7054 (8.5874) 


0.7076 (8.4175) 


0.7869 (17.0116) 


200 


0.7067 (4.4636) 


0.7069 (4.4287) 


0.7856 (8.8942) 



comparable to those calculated via MAVE. In summary, the Monte Carlo 
studies indicate that PrLS performs well. Because the penalized MAVE is not 
easy to obtain, we study only the penalized PrLS estimates in the following 
example. 

Example 4.2. We simulated 500 realizations, each consisting of n = 100 
and 200 random samples, from model (4.2) with a = 0.1 and 0.25, respec- 
tively. The mean function has coefficients a = (1, 3, 1.5, 0.5, 0, 0, 0, 0) T /\/12.5 
and (3 = (3, 2, 0, 0, 0, 1.5, 0, 0.2, 0.3, 0.15, 0, 0) T . To assess the robustness of es- 
timates, we further generate the linear and nonlinear covariates from the 
following three scenarios: (i) the covariate vectors X and Z have 12 and 
8 elements, respectively, which are independent and uniformly distributed 
on [0,1]; (ii) the covariate vector X has 12 elements; the first 5 and last 
5 elements are independent and standard normally distributed, while the 
6th and 7th elements are independently Bernoulli distributed with success 
probability 0.5; the covariate vector Z has 8 elements, which are independent 
and standard normally distributed; (iii) a covariate vector W was generated 
from a 12-dimensional normal distribution with mean and variance 0.25; 
the correlation between wi and Wj is p^~^ with p = 0.4; then the covari- 
ate vector X = W + {1.5exp (1.5zi), 5z x , 5^, 3z 1 + z\, 0, 0, 0, 0, 0, 0, 0, 0} T ; 
moreover, the covariate vector Z has 8 elements, which are independent and 
uniformly distributed on [0, 1] . 

Based on the above model settings, we next explore the performance of 
the penalized profile least-squares approach via SCAD-BIC. Because the 



Table 2 

Simulation results for Example 4-1: the profile least-squares estimates (PrLS) 
and their corresponding mean squared errors (x!0~ 4 ) for model (4-2) 



n 


ai (= 0.5774) 


a 2 (= 0.5774) 


a 3 (= 0.5774) 


/3(=0.3) 


50 
100 
200 


0.5753 (5.8336) 
0.5776 (2.5009) 
0.5782 (1.1533) 


0.5771 (5.5685) 
0.5774 (2.4606) 
0.5771 (1.0852) 


0.5782 (5.9245) 
0.5764 (2.4770) 
0.5764 (1.2483) 


0.2923 (11.4582) 
0.3000 (4.7030) 
0.3004 (2.2026) 
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Table 3 

Simulation results for Example 4.2. S-AIC: SCAD(AIC); S-BIC: SCAD(BIC). 
MRME: median of relative model error; C: the average number of the true zero 
coefficients that were correctly set to zero; I: the average number of the truly 
nonzero coefficients that were incorrectly set to zero 



cr = 0.1 a = 0.25 



n 






a 








P 






a 








P 




MRME 


C 


I 


MRME 


C 


I 


MRME 


C 




I 


MRME 


C 


I 


















scenario (i) 














1UU 


Oracle 


U.ZD 


4 


n 
U 





.28 


6 





0.2 


4 


n 
U 









U 




S-BIC 


0.37 


3.60 


0.08 





.91 


5.32 


0.29 


0.73 


3.29 


0. 


30 


0.86 


4.91 


1.02 




S-AIC 


0.66 


3.08 


0.05 





.97 


4.12 


0.15 


0.75 


2.70 





11 


0.91 


4.02 


0.68 


200 


Oracle 


0.27 


4 








.34 


6 





0.31 


4 







0.39 


6 







S-BIC 


0.33 


3.89 


0.02 





.85 


5.55 


0.02 


0.36 


3.86 





03 


0.94 


5.50 


0.57 




S-AIC 


0.60 


3.39 


0.01 





.92 


4.49 


0.01 


0.62 


3.29 





01 


0.93 


4.43 


0.23 
















scenario (ii) 














100 


Oracle 


0.29 


4 








.35 


6 





0.24 


4 







0.24 


6 







S-BIC 


0.36 


3.75 


0.05 





.88 


5.44 


0.19 


0.66 


3.47 


0. 


27 


0.94 


5.11 


1.07 




S-AIC 


0.65 


3.26 


0.02 





.91 


4.35 


0.07 


0.70 


2.86 


0. 


09 


0.96 


4.04 


0.67 


200 


Oracle 


0.31 


4 








.36 


6 





0.32 


4 







0.3 


6 







S-BIC 


0.36 


3.91 


0.01 





.79 


5.64 


0.01 


0.40 


3.87 


0. 


03 


0.85 


5.53 


0.50 




S-AIC 


0.62 


3.32 








.93 


4.51 


0.01 


0.72 


3.29 


0. 


01 


0.97 


4.45 


0.19 
















scenario (iii) 














100 


Oracle 


0.28 


4 








.15 


6 





0.19 


4 







0.18 


6 







S-BIC 


0.48 


3.67 


0.03 





.82 


5.24 


0.05 


0.50 


3.35 


0. 


21 


0.85 


4.99 


0.56 




S-AIC 


0.74 


3.09 


0.02 





.94 


4.35 


0.03 


0.71 


2.70 





12 


0.98 


4.17 


0.32 


200 


Oracle 


0.32 


4 








.18 


6 





0.29 


4 







0.17 


6 







S-BIC 


0.39 


3.89 








73 


5.52 


0.01 


0.39 


3.80 





04 


0.83 


5.29 


0.12 




S-AIC 


0.68 


3.30 








.84 


4.54 





0.66 


3.13 





02 


0.85 


4.48 


0.03 



Akaike information criterion [Akaike (1973)] has been commonly used for 
classical variable selections, we also study SCAD-AIC by replacing log(n) 
in (2.8) with 2{p + q). To assess the performance, we consider Fan and Li's 
(2001) median of relative model error (MRME), where the relative model 
error is defined as RME = ME/ME Sj? , ME is defined as E(Z T a~ x - Z T a) 2 

for Q; and E(X^-/3s — X T (3) 2 for f3?, and MEg F is the corresponding model 
error calculated by fitting the data with the full model via the unpenalized 
estimates. In addition, we calculate the average number of the true zero 
coefficients that were correctly set to zero and the average number of the 
truly nonzero coefficients that were incorrectly set to zero. Table 3 shows 
that the SCAD-BIC outperforms SCAD-AIC in terms of model error mea- 
sures. Moreover, SCAD-BIC has a much better rate of correctly identifying 
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Fig. 1. Power function of the test statistic T\. 

the true submodel than that of SCAD-AIC, although it sometimes shrinks 
small nonzero coefficients to zero. Unsurprisingly, SCAD-BIC improves as 
the signal gets stronger and the sample size becomes larger, which corrobo- 
rates our theoretical findings. 

Example 4.3. To study the finite-sample performance of the test statis- 
tic T\ in Section 3.1, we consider the same model as that of scenario (i) in 
Example 4.2. Due to the model's parameter setting, we naturally consider 
the following null and alternative hypotheses: 

H : (3 3 = /3 4 = /3 5 = fa = versus Hi : = /3 4 = /3 5 = fa = c\ , 

where c\ ranges from to 0.1 with increment 0.01 for a = 0.1, whereas c\ 
is the value from a set of {0, 0.01, 0.02, 0.09, 0.15, 0.2} for a = 0.25. In 
addition, 500 realizations were generated with n = 200 to calculate the size 
and power of T\. Figure 1 depicts the power function versus c\. It shows 
that the empirical size at c\ = is very close to the nominal level 0.05. 
Furthermore, the power of the test is greater than 0.95 as c\ increases to 
0.05 and 0.15, respectively, when a = 0.1 and a = 0.25. It is not surprising 
that the power increases as the signal gets stronger. In summary, T\ not 
only controls the size well, but is also a powerful test. 

Example 4.4. To examine the performance of the test statistic Ti in 
Section 3.2, we generated 500 realizations from the model given below with 
n = 200. 

(4.3) y = T]{{zi + z 2 + z 3 )/V3} - 0.5xi + 0.3x 2 + ere, 
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Fig. 2. Power function of the test statistic T^. 

where a = 0.1 and c = 0.25, respectively. We then consider the following 
hypotheses: 

Hq:t)(u)=u versus H\ : r]{u) = C2 sin{7r(u — a)/(b — a)} + u, 

where C2 ranges from to 0.1 with increment 0.025 for a = 0.1, while ci 
ranges from to 0.2 with increment 0.05 for a = 0.25. Figure 2 demonstrates 
that the empirical size at C2 = is very close to the nominal level 0.05. 
Furthermore, the power of the test is greater than 0.95 as C2 increases to 
0.075 and 0.2, respectively. As expected, the power increases when the signal 
becomes stronger. Although T<i is slightly less powerful than T±, it controls 
the size well and is a reliable test. 

5. Discussion. In partially linear single-index models, we propose using 
the SCAD approach to shrink parameters contained in both parametric 
and nonpara- metric components. The resulting estimators enjoy the oracle 
property when the regularization parameters satisfy the proper conditions. 
To further exploit SCAD, one could extend the current results to partially 
linear multiple-index models by allowing e to be dependent on (Z T ,X T ) T . 
In addition, one could obtain the SCAD estimator for generalized partially 
linear single-index models. Finally, an investigation of partially linear single- 
index model selection with error-prone covariates could also be of interest. 
We believe that these efforts would enhance the usefulness of SCAD in data 
analysis. 
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APPENDIX: PROOFS OF THEOREMS 

A.l. Proof of Theorem 1. Under the conditions of Theorem 1, we fol- 
low similar arguments to those used by Ichimura (1993) and show that 
£ = (q t ,/3 t ) t is a root-n consistent estimator of Because the proof is 
straightforward, we do not present it here. We next demonstrate the asymp- 
totic normality of £ by using a general result of Newey (1994). 

Let m x (A) = E(X\A), m z (A) = E(Z\A) and n = r)'(A){Z - m z (A)}. In 
addition, let 



(A.l) V(m x , V , k, a, (3, Y,Z,X) = (Y — rj — X T (3) 



K 

X-m x (A) 



For any given m*, rf and k* , define 

D(m* — m x , rf — rj,K* — k, a, (3, Y, Z, X) 

{m x -m x ) + — (t? -r)) + —{K 



dm x x drj Ok 

where the partial derivatives are the Frechet partial derivatives. After alge- 
braic simplification, we have 

w (r-„-xV 



dm x \ — 1 

drj I A — ma; (A) 
9* ,„ ^t^/'I 



where the partial derivatives are zero. Accordingly, 

1 1 W (rrz.* , r]* , k* , oc, (3,Y, Z, X) — ty(m x , rj, K,a, (3,Y, Z, X) 
(A.2) - D{m* x - m x , rf - rj, k* - k, a, (3, Y, Z, X) \\ 

= 0(\\m* x - m x f + \\rf - r]\\ 2 + \\k* - k|| 2 ), 

where || • || denotes the Sobolev norm, that is, the supremum norm of the 
function itself, as well as its derivatives. Equation (A.2) is Newey's As- 
sumption 5.1(i). It is also noteworthy that his Assumption 5.2 holds by the 
expression of D(-,-,-,a,(3,Y,Z,X). Moreover, the result 

E{D(m* x - m x , r]* - rj, k* - k, a, (3, Y, Z, X)} = 

leads to Newey's Assumption 5.3. 
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In addition to Newey's assumptions mentioned above, we need to verify 
one more assumption before employing his result. To this end, we re-express 
the solution of (2.2) as 



1 - 



{s 2 (u, h) - sxiu, h)(Ai - u)}K h {K - u)(Xi ~ Xj(3) 



n S2(u,h)§o(u,h) — s\(u,h) 

where A; is the ith row of A and 
1 n 

s r (u, h) = - y^(Ai - u) r K h (Ai - it) for r = 0, 1,2. 



n . 



Then, let 



- t\ 1 {s 2 (u,h) - si(u,h)(Ai - u)}K h (ki - u)Xj 
m x [u) = -y ■ 



n ~[ §2(u,h)§o(u,h) — §i(u,h) 



T 



- f\ 1 {s 2 (u,h) -h(u,h)(Ai -u)}K h (Ai -u)Z, 

ffl ( 11 ) — / — 

z n i-i S2{u,h)so(u,h) — sf(u,h) 

Applying similar techniques to those used in Mack and Silverman (1982), 
we obtain the following equations, which hold uniformly in uEU: 

fj(u) - rj(u) = o p (n~ 1/4 ), 77' (it) - 7/ (it) = o p (n" 1/4 ), 

(A.3) 

ihx(u) — m x {u) = o p (n 1 ^ 4 ) and m z (u) — m z (u) = o p (n 1//4 ). 

These results imply that k — n = o p (n -1 / 4 ). Thus, Newey's Assumption 5.1(h) 
holds. 

After examining Newey's Assumptions 5.1-5.3, we apply his Lemma 5.1 
and find that £ has the same limit distribution as the solution to the equation 

n 

(A.4) = J2^(m x ,r ] ,K,a,P,Y i ,Z i ,X i ). 

i=i 

Furthermore, it is easy to show that the solution to (A.4) has the same limit 
distribution as described in the statement of Theorem 1. Hence, we complete 
the proof of asymptotic normality. 

Finally, we show the efficiency of £. Let p e (s) be the probability density 
function of e and let p' E (e) be its first-order derivative with respect to e. 
Then, the score function of (at, (3) is 

For any given function g of (Z, X), it can be shown that the nuisance tan- 
gent space V, for the three nuisance parameters, giz,x){ z i x )-> Pe{^) and 77(A), 
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is {g(Z,X) :E(g) = 0,E(eg) is a function of (Z,X) only}. Furthermore, the 
orthogonal component of V is 

V ± = {eg(Z,X):E(g\A)=0}. 

Subsequently, we apply the approach of Bickel et al. (1993) and obtain the 
following semiparametric efficient score function via equation (2.4): 

(A.5) = 

It can be seen that S e g G V 1 " . 

For any eg £ , we have £/(p|A) = 0. Accordingly, 

£{(S (ai/3) -S cff ) T e 5 (Z,X)} 

a fi/(A)Z] T fep'M 



" ( °[l X ) g - E \ E(X\A) ) 9 

Because E{ep' e (e)/p e (e)} = — 1, it follows that 

E{(S {a ,(3) ~ S cS feg(Z,X)}=E[{7 ] '(A)E(Z T \A),E(X T \A)}E(g\A)}=0. 

That is, S e g is the projection of S^ a ^ onto V and the estimator C, is there- 
fore efficient [see Bickel et al. (1993)]. We have thus completed the proof. 

A. 2. Proof of Theorem 2. To prove this theorem, we consider the follow- 
ing three steps: Step I establishes the order of the minimizer (oi\ , of 

£p(a,(3); Step II shows that (6l^ , ) T attains sparsity; Step III derives 
the asymptotic distribution of the penalized estimators. 

Step I. Let 7„ = n~ 1 / 2 + a n + c n , vi = (v n , ■ ■ ■ ,vi P ) T , v 2 = (v 2 i, ■ ■ ■ ,v 2q ) T 
and ||vi|| = 1 1 V2 1 1 = C for some positive constant C, where 

a n = max { \p' x (| a 0j |) | , a j ^ 0}, 
i<?<p J 

c n = max { |p' A ( | A)fc | ) | , A)fc ^ 0}, 

l<k<q Ah 

and «oj and A)fc are the jth and Arth elements of cto and f3 , respectively, 
for j = 1, . . . ,p and k = 1, . . . ,q. Furthermore, define 



D n ,i = ^2i Y i ~ v( z 7i a o + 7nVi),a + 7nVi,/3 + 7„v 2 ) - X 4 T (/3 + 7„v 2 )}^ 
i=l 

n 

- " ^(^ Tq o, «o,/3 ) " Xjp } 2 



i=l 
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and 

Po 

D n ,2 = -nJ^{PAy(|aoj +lnvij\) -PAydaojl)} 

J'=l 

go 

- " J^^fe ( I P® k + 7»«2k I ) - P\ 2k ( I A)fc I ) } • 

k=l 

After algebraic simplification, we have 

n 

(a + 7nVi), a + 7nVi, (3 + 7„v 2 ) 



(A.6) 



i=l 

- f)(Z?a ,a ,f3 ) + Xjj n v 2 } 
x OK^C^o + 7nVi), a + 7„vi, /3 + 7„v 2 ) 

+ r?(Z?ao, a ,/3 ) + A 4 T (/3 + 7n v 2 ) + X?/3 - 2YJ 

n 

= £{i?y (A<)vi7„ + ^ T v 27 „.} 2 
1=1 

n 

- J^^V (Ai)vi7„ + X?V 2 7n}ei + o p (l). 

8=1 

Moreover, applying the Taylor expansion and the Cauchy-Schwarz inequal- 
ity, we are able to show that n D n % is bounded by 

V^07nOn||vi|| +7n & n|l v l|| 2 + V%ln C n \ | V 2 1 1 +7n^n||v 2 || 2 

< C^li^ + bnC + ^+d n C), 

where 

K = max {\p'L(\aoj\)\,Oioj ± 0}, d n = max (|A)ifc|)|,A)fc + 0}. 
i<j<p J i<fc<<? 

When 6 n and d n tend to and C is sufficiently large, the first term on 
the right-hand side of (A.6) dominates the second term on the right-hand 
side of (A.6) and D n 2- As a result, for any given v > 0, there exists a large 
constant C such that 

p{inf£ P (a + 7 n vi,/3 + 7„v 2 ) > C P (a ,P ) \ >l-v, 

kVl2 J 

where Vi 2 = {(vi, v 2 ) : ||vi|| = C, ||v 2 || = C}. We therefore conclude that the 
rate of convergence of (ac£ , ) T is Op(n~ x l 2 + a n + c n ). 

Step II. Let ol\ and (3 1 satisfy ||oti — ctioll = Op{n~ 1 / 2 ) and \\f3 l — /3io 1 1 = 
Op(n -1 / 2 ), respectively. We next show that 

<-) M(?)-(*)}-*M(£)-(£ 
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where C = {\\a 2 \\ < C*n~ 1 / 2 , \\(3 2 \\ < C*n~ 1 / 2 } and C* is a positive con- 
stant. 

Consider p k £ {-C*n~ 1 / 2 , C*rT l l 2 ) for k = q Q + 1, . . . , q. When f3 k + 0, we 
have d£p(a,P)/dp k =£ k (a,/3) + np' X2k (\P k \)sgn.(p k ), where 

4(a,/J) = -f> - fl^a./J) + ^/3}{x ifc + W 2 ?"^® 

1 = 1 

1 n 

- ^{f)(Z? a , a, /3) - 77(^00) +X?(J3- (3 ) - e l } 
dfj(Z?a,a,f3) 



n ■ 
i=l 



Applying arguments similar to those used in the proof of Theorem 5.2 of 
Ichimura (1993), together with algebraic simplifications, the above term can 
be expressed as 

1 n 1 n 

- V(a - a ) T ^r ? / (A i )X jfc + - V(/3 - XiX ik + o^n" 1 / 2 ). 
n n 

i=l i=l 

Using the assumptions that ||ck — oto || = Op(n -1 / 2 ) and ||/3 — /3 || = Op(ra -1 / 2 ), 
we have that n~ 1 £ k (ot, (3) is of the order Op(n~ 1 / 2 ). Therefore, 

dCP Qpf ] = -nA 2fc {A 2 -M 2fc (|/3 fe |)sgn(^) + Op^ 1 /^ 1 )}. 

Because liminf^oo limmfg._ K) + A^Va^ (I Afel) > an d n ~ 1/2 /^2k ->■ 0, 
dCp(a,f3)/d/3 k and /3 fc have different signs for /3 fe G (-C*n -1 / 2 , C*n -1 / 2 ). 
Analogously, we can show that dCp(a, (3)/dotj and have different signs 
when Oj G (— C*?!" 1 / 2 , C*?!™ 1 / 2 ) for j = po + 1, • • • ,2?- Consequently, the min- 
imum is attained at a 2 = and /3 2 = 0. This completes the proof of (A. 7). 

Step III. Finally, we demonstrate the asymptotic normality of ckaiI and 
f3\ 2 i- For the sake of simplicity, we define 

R Aj = {PAn ( I «oi I ) sgn(a i ) , • • • , Px lpo ( I ao Po I ) sgn(a 0po ) } T > 

S Al = diag{p£ n (|ooi|), . . . ,p'a 1po (|aop l)}> 

R A 2 = {PAai (lA)l|) sgn(A)i), ■ • ■ ,Px 2qo (\Po qo I) sgn(/3 0go )} T , 

S A2 =diagM 2i (|/3 01 |),...,^ 29o (|/? 09o |)}, 

where aoj and /3ofc are the jth and kth elements of ctio and /3 10 , respectively, 
for j = 1, . . . ,po and = 1, . . . , qo. It follows from (2.5) that &A1I an d fi\ 2 \ 
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( dC P (a Xl i,(3 x , 



0= < 



dcti 

dCp{a. Xl i,P\ 2l ) 



(A.8) 



where 



df3 l 
l(a Xll ,f3 X2l ) + 



R-Ai - S Ai(«Ail - a lo) 

H-A 2 - s a 2 (/3a 2 i - P10) 



1 n 



i=i 



^(A^Cai) 



3a 

9/3 



Aj and Xn here being the ith rows of A and X±, respectively, and C A1 = 
(okJiI' ^a 2 i) T being the penalized least-squares estimator of C\ = ( a i , Pj) T ■ 
Applying the Taylor expansion, we obtain 



1 n 

l(& Xll ,(3 X2l ) = - ^{Yt, - f)(Ai, Ci) - X?M < 

^(A i5 Ci) 



i=l 



dati 

dfj(Ai,C 



1 

--E 



i=l 



da.\ 
df,(Ai,C 



0/3, 
(Cai-Ci) 



U + Xi 



8(3, 



1 -+ + x ia 



- i £{y< - ^(Ai, Ci) - x?M ** K ^\ cxi - d), 



i=l 



where /3 1? Aj and are the interior points between (3, and (3 X2 i, Aj and 
Aj, and ^ 1 and £ A1 , respectively. Furthermore, using arguments similar to 
those used in the proof of Theorem 1, we have that 



l(a Xl i,(3 X2l ) 



1 



i=l 



n E 



^iV(Ai) I 
*M J 



r- ( "Ail - "10 \ 
Vn . + 

V/3a 2 i-/3io/ 



o p (l). 
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Moreover, the summand of the matrix over n in the second term of the above 
equation converges to ZlZl XlZl . These results, together with (A. 8), 

\r~~ r r, -r; / 



lead to 



X 1 Z 1 



1/2 [ F Z 1 Z 1 + \ /&Ail-<*l(A _ 1/2 / R Ax 



It follows that 

nl/2 ( r z 1 z 1 + SaJC^i - «io) + « 1/2 r M 03 A2l - /3i ) - n 1/2 RA! 

1 n ~ 
= ^y"^,i ? ?'(Ai)ei + o p (l) 

V n ^ 

1=1 

and 

n 1 / 2 !^- (a All - aio) + n 1/2 (r M + £a 2 )0a 2 i - (3 W ) - n x / 2 R A2 

1 " ~ 
= -^=y^X it iei + o p {l). 

After simplification, we have 

>/"{( r z!Zi + s a x ) - ^^(r^^ + EAaJ-^^XaAu - aw) 
(A.9) +n 1 / 2 {r jfi ^ i (r^ i +S Al )- 1 R Al -R A2 } 
1 n ~ 

= Tm E^^V(Ai) - r^cr^ + Ea,)- 1 ^,!^ + 0p (i) 

i=l 

and 

^{(^x, + s a 2 ) - ^(T&fc + SaJ-^^^K^i - /3i ) 
(A.10) +n 1 / 2 {r| i ~ i (r^^ i +E Al )- 1 R A2 -R Al } 



1 n „ 

-= ^{a,i - ri ~ (r^ + e^j-^VCA,)}^ + o p 



(i)- 



Equations (A.9) and (A. 10), together with the central limit theorem, yield 
that 

Vn^z^ + s Ai - r M (r M + s A2 )- 1 r| i ~ i }(a Al i - « 10 ) 

+ n 1 / 2 {r^^(r M +E A2 )" 1 R Al -R A2 }^iV(0,E ai ) 
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and 

v^{r M + s A2 - i^Cr&fc + s Al )- 1 r^ i ^ i }0 A2l - /3 10 ) 
+ n 1 /2 { rJ i ~ i (r^ i ^ i + s Al )- 1 R A2 -R Al }^iv(o,s / 3 i ), 



where 



and 

+ r l 1 z 1 ( r z 1 z 1 + s Ai)~ lr z 1 z 1 ( r z 1 z 1 + £ Al )- 1 r jfi ^ i }(7 2 . 

Because each element of n 1 / 2 S Al , n 1//2 £ Aa , n 1 / 2 R Al and n 1 / 2 R Aa tends to 
zero, we complete the proof. 

A. 3. Proof of Theorem 3. Let r n = log(n), \ n \j = r n SE(a"), \ n 2k = 
r n SE(^) and 

BIC(S T ) = logK(S T )} + {log(n)/n} DF(5 T ), 

where DF(,Sr) stands for the degrees of freedom of the true model St- 
SE(&1) = 0(1/ y/n) and SE0J) = 0(1/ y/n). Thus, X nlj = O p {\og(n)/y/n) 
and A n 2fe = Op (log (n)/y/n). Then, employing techniques similar to those 
used in Wang, Li and Tsai (2007), we obtain that 

(A.ll) P{BIC(r n ) = BIC(S T )} = 1. 

Therefore, to prove the theorem, it suffices to show that 

(A.12) pi inf BIC(A)>BIC(r n ))^l, 

where 

n_ = {\:Sx?> S T } and n+ = {\: S X D S T } 

represent the underfitted and overfitted models, respectively. 

To demonstrate (A.12), we consider two separate cases given below. 

Case 1: Underfitted model (i.e., the model misses at least one covariate 
from the true model). For any A £ f2_, (A.ll). together with assumptions (A) 
and (B), implies that, with probability tending to one, 

BIC(A) - BIC(r n ) = log{MSE(A)} + {log(n) /n} DF A - BIC(St) 

> log{MSE(A)} - BlC(Sr) 
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>logK(5 A )}-BIC(5 T ) 

> inf log{^(5 A )}-BIC(5 T ) 

> min log{a 2 n (S)} - log{a 2 n (S T )} - {log(n)/n} DF(S T ) 

-^unnlog{a 2 (S)/a 2 (S T )}>0. 

Case 2: Overfitted model (i.e., the model contains all of the covariates in 
the true model and includes at least one covariate that does not belong to 
the true model). For any A G it follows by (A. 11) that, with probability 
tending to one, 

n{BIC(A) - BIC(r n )} = n{BIC(A) - BIC(5 T )} 

= nlog{MSE(A)Aj2 (S T )} + (DF A - DF 5t ) log(n) 

> nlog{a 2 (S x )/a 2 (S T )} + (DF A - DF 5t ) log(n) 

n{a 2 n (S x )-a 2 (S T )} (i 
= 27^ 1 1 + op(! I 

+ (DF A -DF 5T )log(n). 

Applying the result of Theorem 4, we know that n{a 2 (S x ) - a 2 n {S T )} / a 2 n (S T ) 
is an asymptotically chi-squared distribution with DF A — DFg T degrees of 
freedom. Accordingly, we obtain that n{al(S x ) - al(S T )} / cr 2 , (S T ) = P (l). 
Moreover, for any A € f2 + , DF A — DFs T > 1, and hence (DF A — DFg T ) log(n) 
diverges to +oo as n — > oo. Consequently, 

P{ inf n{BIC(A) -BIC(r„)} >o| =p{ inf BIC(A) - BIC(r„) > o) -)• 1. 

lAGft+ J lAG^+ J 

The results of Cases 1 and 2 complete the proof. 

A. 4. Proof of Theorem 4. We apply techniques similar to those used in 
the proofs of Theorems 3.1 and 3.2 in Fan and Huang (2005) to show this 
theorem. Accordingly, we only provide a sketch of a proof here; detailed 
derivations can be obtained from the authors upon request. 

Let 



Br, 



E 

,i=i 



X2' 



E 



Xi 



*2' 



The difference Q(Hq) — Q(H\) can be expressed as 

n n 

^{Y, - Xjp - f,(Zjcx Xo)} 2 ~ ~ X ?Pi ~ v(ZlcciXi)}' 



i=l 



i=l 
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= Y,{h{z?&u Ci) - *K^&o, Co) + - A))} 2 
i=i 

n 

(A.13) + 2^{y 4 - - 77(^0:1,^1)} 

i=l 

x{^ 1 (^a 1 ,C 1 )-7}(zT« ,C ) + ^ T (^ 1 -^ )} 
= f Q1 + Q2. 

It can be shown that Q2 is asymptotically negligible in probability. Further- 
more, Qi can be simplified as 

<?> = - c„) T E { f (<i - c„) + *<i). 

i=l 1 Ai J 

A direct calculation yields that £ = Ci — -BnCi- This, together with the 
asymptotic normality and consistency of ^ 1 obtained from Theorem 1, im- 
plies that o~ 2 Q\ — > Xm i n distribution under Hq. Moreover, under Hi, 
(7~ 2 Qi asymptotically follows a noncentral chi-squared distribution with m 
degrees of freedom and noncentrality parameter </>. This completes the proof. 

A. 5. Proof of Theorem 5. It is noteworthy that 

n 

RSS(#o) = ^{Yi - Xj(3 - v(Zlcc)} 2 



+ ^[{y, - Xjfi - f,{Zja.)} 2 - {Y t - X?/3 - v{Zja.)} 2 ] 
i=i 

cf RSS*(iIo) + InO 



and 

n 

rss(^i) = J2i Y i- x ?P-fi(z?&)¥ 



i=i 



+ ^[{y 4 - Xjji - f)(Z?&)} 2 - {Y t - Xj(3 - ii(Zja.)} 2 } 
i=i 

= RSS*(Hi)+I nl . 

Applying similar arguments to those in the proof of Theorem 5 in Fan, Zhang 
and Zhang (2001), under Hq, we have 

nr K RSS*(# )-RSS*(fli) 2 



RSS*(ifi 
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where df n is defined as in Theorem 5 and it approaches infinity as n — > oo. 
Furthermore, it can be straightforwardly shown that n~ l I n Q = <7 2 {l + op(l)} 
and n~ l I n \ = a 2 {l + op(l)}. These results complete the proof. 
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